Goto

Collaborating Authors

 spectral data


Self-supervised and Multi-fidelity Learning for Extended Predictive Soil Spectroscopy

Sun, Luning, Safanelli, José L., Sanderman, Jonathan, Georgiou, Katerina, Brungard, Colby, Grover, Kanchan, Hopkins, Bryan G., Liu, Shusen, Bremer, Timo

arXiv.org Artificial Intelligence

We propose a self-supervised machine learning (SSML) framework for multi-fidelity learning and extended predictive soil spectroscopy based on latent space embeddings. A self-supervised representation was pretrained with the large MIR spectral library and the Variational Autoencoder algorithm to obtain a compressed latent space for generating spectral embeddings. At this stage, only unlabeled spectral data were used, allowing us to leverage the full spectral database and the availability of scan repeats for augmented training. We also leveraged and froze the trained MIR decoder for a spectrum conversion task by plugging it into a NIR encoder to learn the mapping between NIR and MIR spectra in an attempt to leverage the predictive capabilities contained in the large MIR library with a low cost portable NIR scanner. This was achieved by using a smaller subset of the KSSL library with paired NIR and MIR spectra. Downstream machine learning models were then trained to map between original spectra, predicted spectra, and latent space embeddings for nine soil properties. The performance of was evaluated independently of the KSSL training data using a gold-standard test set, along with regression goodness-of-fit metrics. Compared to baseline models, the proposed SSML and its embeddings yielded similar or better accuracy in all soil properties prediction tasks. Predictions derived from the spectrum conversion (NIR to MIR) task did not match the performance of the original MIR spectra but were similar or superior to predictive performance of NIR-only models, suggesting the unified spectral latent space can effectively leverage the larger and more diverse MIR dataset for prediction of soil properties not well represented in current NIR libraries.


GAMMA_FLOW: Guided Analysis of Multi-label spectra by MAtrix Factorization for Lightweight Operational Workflows

Rädle, Viola, Hartwig, Tilman, Oesen, Benjamin, Kröger, Emily Alice, Vogt, Julius, Gericke, Eike, Baron, Martin

arXiv.org Artificial Intelligence

GAMMA_FLOW is an open-source Python package for real-time analysis of spectral data. It supports classification, denoising, decomposition, and outlier detection of both single- and multi-component spectra. Instead of relying on large, computationally intensive models, it employs a supervised approach to non-negative matrix factorization (NMF) for dimensionality reduction. This ensures a fast, efficient, and adaptable analysis while reducing computational costs. gamma_flow achieves classification accuracies above 90% and enables reliable automated spectral interpretation. Originally developed for gamma-ray spectra, it is applicable to any type of one-dimensional spectral data. As an open and flexible alternative to proprietary software, it supports various applications in research and industry.


CNN-BiLSTM for sustainable and non-invasive COVID-19 detection via salivary ATR-FTIR spectroscopy

Junior, Anisio P. Santos, Sabino-Silva, Robinson, Martins, Mário Machado, Cunha, Thulio Marquez, Carneiro, Murillo G.

arXiv.org Artificial Intelligence

The COVID-19 pandemic has placed unprecedented strain on healthcare systems and remains a global health concern, especially with the emergence of new variants. Although real-time polymerase chain reaction (RT-PCR) is considered the gold standard for COVID-19 detection, it is expensive, time-consuming, labor-intensive, and sensitive to issues with RNA extraction. In this context, ATR-FTIR spectroscopy analysis of biofluids offers a reagent-free, cost-effective alternative for COVID-19 detection. We propose a novel architecture that combines Convolutional Neural Networks (CNN) with Bidirectional Long Short-Term Memory (BiLSTM) networks, referred to as CNN-BiLSTM, to process spectra generated by ATR-FTIR spectroscopy and diagnose COVID-19 from spectral samples. We compare the performance of this architecture against a standalone CNN and other state-of-the-art machine learning techniques. Experimental results demonstrate that our CNN-BiLSTM model outperforms all other models, achieving an average accuracy and F1-score of 0.80 on a challenging real-world COVID-19 dataset. The addition of the BiLSTM layer to the CNN architecture significantly enhances model performance, making CNN-BiLSTM a more accurate and reliable choice for detecting COVID-19 using ATR-FTIR spectra of non-invasive saliva samples.


Sky Background Building of Multi-objective Fiber spectra Based on Mutual Information Network

Zhang, Hui, Cai, Jianghui, Yang, Haifeng, Luo, Ali, Yang, Yuqing, Kong, Xiao, Ding, Zhichao, Zhou, Lichan, Han, Qin

arXiv.org Artificial Intelligence

Sky background subtraction is a critical step in Multi-objective Fiber spectra process. However, current subtraction relies mainly on sky fiber spectra to build Super Sky. These average spectra are lacking in the modeling of the environment surrounding the objects. To address this issue, a sky background estimation model: Sky background building based on Mutual Information (SMI) is proposed. SMI based on mutual information and incremental training approach. It utilizes spectra from all fibers in the plate to estimate the sky background. SMI contains two main networks, the first network applies a wavelength calibration module to extract sky features from spectra, and can effectively solve the feature shift problem according to the corresponding emission position. The second network employs an incremental training approach to maximize mutual information between representations of different spectra to capturing the common component. Then, it minimizes the mutual information between adjoining spectra representations to obtain individual components. This network yields an individual sky background at each location of the object. To verify the effectiveness of the method in this paper, we conducted experiments on the spectra of LAMOST. Results show that SMI can obtain a better object sky background during the observation, especially in the blue end.


Generating Light-based Fingerprints for Indoor Localization

Lee, Hsun-Yu, Lin, Jie, Wu, Fang-Jing

arXiv.org Artificial Intelligence

Radio-frequency solutions (e.g., Wi-Fi, RFID, UWB) are widely adopted but remain vulnerable to multipath fading, interference, and uncontrollable coverage variation. We explore an orthogonal modality--visible light communication (VLC)--and demonstrate that the spectral signatures captured by a low-cost AS7341 sensor can serve as robust location fingerprints. We introduce a two-stage framework that (i) trains a multi-layer perceptron (MLP) on real spectral measurements and (ii) enlarges the training corpus with synthetic samples produced by T abGAN. The augmented dataset reduces the mean localization error from 62.9 cm to 49.3 cm--a 20% improvement--while requiring only 5% additional data-collection effort. Experimental results obtained on 42 reference points in a U-shaped laboratory confirm that GAN-based augmentation mitigates data-scarcity issues and enhances generalization.


Detection of Adulteration in Coconut Milk using Infrared Spectroscopy and Machine Learning

Al-Awadhi, Mokhtar A., Deshmukh, Ratnadeep R.

arXiv.org Artificial Intelligence

In this paper, we propose a system for detecting adulteration in coconut milk, utilizing infrared spectroscopy. The machine learning-based proposed system comprises three phases: preprocessing, feature extraction, and classification. The first phase involves removing irrelevant data from coconut milk spectral signals. In the second phase, we employ the Linear Discriminant Analysis (LDA) algorithm for extracting the most discriminating features. In the third phase, we use the K-Nearest Neighbor (KNN) model to classify coconut milk samples into authentic or adulterated. We evaluate the performance of the proposed system using a public dataset comprising Fourier Transform Infrared (FTIR) spectral information of pure and contaminated coconut milk samples. Findings show that the proposed method successfully detects adulteration with a cross-validation accuracy of 93.33%.


Boosting LLM's Molecular Structure Elucidation with Knowledge Enhanced Tree Search Reasoning

Zhuang, Xiang, Wu, Bin, Cui, Jiyu, Feng, Kehua, Li, Xiaotong, Xing, Huabin, Ding, Keyan, Zhang, Qiang, Chen, Huajun

arXiv.org Artificial Intelligence

Molecular structure elucidation involves deducing a molecule's structure from various types of spectral data, which is crucial in chemical experimental analysis. While large language models (LLMs) have shown remarkable proficiency in analyzing and reasoning through complex tasks, they still encounter substantial challenges in molecular structure elucidation. We identify that these challenges largely stem from LLMs' limited grasp of specialized chemical knowledge. In this work, we introduce a Knowledge-enhanced reasoning framework for Molecular Structure Elucidation (K-MSE), leveraging Monte Carlo Tree Search for test-time scaling as a plugin. Specifically, we construct an external molecular substructure knowledge base to extend the LLMs' coverage of the chemical structure space. Furthermore, we design a specialized molecule-spectrum scorer to act as a reward model for the reasoning process, addressing the issue of inaccurate solution evaluation in LLMs. Experimental results show that our approach significantly boosts performance, particularly gaining more than 20% improvement on both GPT-4o-mini and GPT-4o. Our code is available at https://github.com/HICAI-ZJU/K-MSE.


Neural Integral Operators for Inverse problems in Spectroscopy

Zappala, Emanuele, Giola, Alice, Kramer, Andreas, Greco, Enrico

arXiv.org Artificial Intelligence

Deep learning has shown high performance on spectroscopic inverse problems when sufficient data is available. However, it is often the case that data in spectroscopy is scarce, and this usually causes severe overfitting problems with deep learning methods. Traditional machine learning methods are viable when datasets are smaller, but the accuracy and applicability of these methods is generally more limited. We introduce a deep learning method for classification of molecular spectra based on learning integral operators via integral equations of the first kind, which results in an algorithm that is less affected by overfitting issues on small datasets, compared to other deep learning models. The problem formulation of the deep learning approach is based on inverse problems, which have traditionally found important applications in spectroscopy. We perform experiments on real world data to showcase our algorithm. It is seen that the model outperforms traditional machine learning approaches such as decision tree and support vector machine, and for small datasets it outperforms other deep learning models. Therefore, our methodology leverages the power of deep learning, still maintaining the performance when the available data is very limited, which is one of the main issues that deep learning faces in spectroscopy, where datasets are often times of small size.


A Self-supervised Learning Method for Raman Spectroscopy based on Masked Autoencoders

Ren, Pengju, Zhou, Ri-gui, Li, Yaochong

arXiv.org Artificial Intelligence

Raman spectroscopy serves as a powerful and reliable tool for analyzing the chemical information of substances. The integration of Raman spectroscopy with deep learning methods enables rapid qualitative and quantitative analysis of materials. Most existing approaches adopt supervised learning methods. Although supervised learning has achieved satisfactory accuracy in spectral analysis, it is still constrained by costly and limited well-annotated spectral datasets for training. When spectral annotation is challenging or the amount of annotated data is insufficient, the performance of supervised learning in spectral material identification declines. In order to address the challenge of feature extraction from unannotated spectra, we propose a self-supervised learning paradigm for Raman Spectroscopy based on a Masked AutoEncoder, termed SMAE. SMAE does not require any spectral annotations during pre-training. By randomly masking and then reconstructing the spectral information, the model learns essential spectral features. The reconstructed spectra exhibit certain denoising properties, improving the signal-to-noise ratio (SNR) by more than twofold. Utilizing the network weights obtained from masked pre-training, SMAE achieves clustering accuracy of over 80% for 30 classes of isolated bacteria in a pathogenic bacterial dataset, demonstrating significant improvements compared to classical unsupervised methods and other state-of-the-art deep clustering methods. After fine-tuning the network with a limited amount of annotated data, SMAE achieves an identification accuracy of 83.90% on the test set, presenting competitive performance against the supervised ResNet (83.40%).


Artificial Intelligence in Spectroscopy: Advancing Chemistry from Prediction to Generation and Beyond

Guo, Kehan, Shen, Yili, Gonzalez-Montiel, Gisela Abigail, Huang, Yue, Zhou, Yujun, Surve, Mihir, Guo, Zhichun, Das, Prayel, Chawla, Nitesh V, Wiest, Olaf, Zhang, Xiangliang

arXiv.org Artificial Intelligence

The rapid advent of machine learning (ML) and artificial intelligence (AI) has catalyzed major transformations in chemistry, yet the application of these methods to spectroscopic and spectrometric data, referred to as Spectroscopy Machine Learning (SpectraML), remains relatively underexplored. Modern spectroscopic techniques (MS, NMR, IR, Raman, UV-Vis) generate an ever-growing volume of high-dimensional data, creating a pressing need for automated and intelligent analysis beyond traditional expert-based workflows. In this survey, we provide a unified review of SpectraML, systematically examining state-of-the-art approaches for both forward tasks (molecule-to-spectrum prediction) and inverse tasks (spectrum-to-molecule inference). We trace the historical evolution of ML in spectroscopy, from early pattern recognition to the latest foundation models capable of advanced reasoning, and offer a taxonomy of representative neural architectures, including graph-based and transformer-based methods. Addressing key challenges such as data quality, multimodal integration, and computational scalability, we highlight emerging directions such as synthetic data generation, large-scale pretraining, and few- or zero-shot learning. To foster reproducible research, we also release an open-source repository containing recent papers and their corresponding curated datasets (https://github.com/MINE-Lab-ND/SpectrumML_Survey_Papers). Our survey serves as a roadmap for researchers, guiding progress at the intersection of spectroscopy and AI.